Comparing Clustering methods in Customer Segementation

Handling Missing Data And Preprocessing

We need to first in order to do any data preprocessing, deal with missing data. Included is a helpful graphic created by youtuber data professor

Dealing With Missing Data

The first thing, which should be looked at is the columns, according to the descriptions available from the source of the dataset, here. The two columns Z_Revenue, Z_CostContact have the same data in each record and do not add any informaiton and can be dropped. This we can confirm by checking for unique values in those rows before dropping them, we are dropping them to prior to modifying the dataset further so they do not get in the way.

Lets first look at the information of the dataset, this can give us a better look into the data we are working with. For this we will just preform a simple listwise deletion for any data which is missing.

Lets look at the summary statistics of the data we will be using during the dimensionality reduction.

Now that we have removed data which is not necessary for analysis we can continue to do some preprocessing, turning categorical data into numerical values which can be used. First we will take the dt_customer and turn it into a integer of how long they have been a customer with a 0 for the newest customer and days from that date as the others.

Lets see the summary statisitics of the data, this will allow us to see any unusual data such as outliers

We can also look at graphing the data to see if there are any other outliers.

We can see there are outliers in Income and Age so we can remove them thorugh listwise deletion.

We can then scale the data to make clustering easier for the algorithm.

Dimesntionality Reduction

Preform PCA to do dimensionality reduction to 3 columns.

Lets use the elbow method (knee point detection) to decide how many cluster we should have.

Lets create the clusters and evaluate the clusters creared by different methods agglomatic, kmeans through minibatch, and spectral clustering.

Compare Clusters to see if there is a meaningful difference

There does not seem to be a meaningful difference between the classification methods, this means that the personas created from any method should be the same for each of the clustering methods which means we only need to create one persona for each cluster we select Agglomerative Clustering to create the customer personas for the segments.